Imbalanced Classification

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐ Classification ⮐

Imbalanced classification addresses scenarios where the distribution of classes is heavily skewed, with one or more classes (minority) significantly underrepresented compared to others (majority). This is extremely common in real-world applications: fraud detection (fraudulent transactions might be <1% of total), medical diagnosis (rare diseases), spam filtering, anomaly detection, and fault detection in manufacturing. Standard classification algorithms trained on imbalanced data often develop a bias toward the majority class, achieving high overall accuracy while failing to identify minority class instances.

The accuracy paradox illustrates the core challenge: a classifier that always predicts the majority class achieves high accuracy but zero utility. If fraud occurs in 0.1% of transactions, predicting "not fraud" for everything yields 99.9% accuracy while catching zero fraud cases. This makes accuracy a misleading metric for imbalanced problems, necessitating alternative evaluation approaches focused on minority class performance.

Data-level approaches modify the training distribution. Random oversampling duplicates minority class examples, risking overfitting to specific instances. Random undersampling removes majority class examples, potentially discarding useful information. SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority examples by interpolating between existing instances in feature space. Tomek Links and Edited Nearest Neighbors remove majority instances near decision boundaries to create cleaner separation. Hybrid methods combine oversampling and undersampling.

Algorithm-level approaches modify learning algorithms. Cost-sensitive learning assigns higher misclassification costs to minority class errors, encoded through class weights in the loss function. Many algorithms (SVM, logistic regression, tree-based methods) support class weighting. Threshold adjustment shifts the decision boundary by changing the probability threshold for classification (e.g., using 0.2 instead of 0.5 for positive class). Ensemble methods like BalancedRandomForest or EasyEnsemble combine resampling with ensemble learning.

Evaluation metrics for imbalanced classification prioritize minority class performance. Precision-Recall curves and Average Precision focus on positive class performance. F1-score (or F-beta score with adjustable weight) balances precision and recall. Matthews Correlation Coefficient provides a balanced measure even with imbalance. ROC-AUC can still be useful but may be optimistic. Per-class metrics (sensitivity, specificity, precision for each class) reveal performance disparities hidden by aggregate metrics.

Anomaly detection represents an extreme form of imbalance where the minority class is exceptionally rare and may not be well-represented in training data. This often requires specialized one-class classification or unsupervised anomaly detection methods rather than standard supervised approaches.

Popular Algorithms

SMOTE (Synthetic Minority Oversampling) - Generates synthetic minority class examples by interpolating between neighbors in feature space
- https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
Random Over/Under Sampling - Simple resampling by duplicating minority instances or removing majority instances
- https://imbalanced-learn.org/stable/references/over_sampling.html
- https://imbalanced-learn.org/stable/references/under_sampling.html
Cost-Sensitive Learning - Assigns higher misclassification costs to minority class; supported by many standard algorithms via class weights
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html (class_weight parameter)
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html (class_weight parameter)
Balanced Random Forest - Random forest with balanced bootstrap sampling ensuring equal representation of classes in each tree
- https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html
EasyEnsemble - Ensemble method that creates balanced subsets from majority class combined with all minority instances
- https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.EasyEnsembleClassifier.html
RUSBoost - Combines random undersampling with AdaBoost for handling imbalanced data
- https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.RUSBoostClassifier.html
Threshold Optimization - Adjusts decision threshold based on precision-recall tradeoff rather than using default 0.5
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html